Scaling Record Linkage to Non-uniform Distributed Class Sizes

نویسندگان

  • Steffen Rendle
  • Lars Schmidt-Thieme
چکیده

Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experimental Study on Using Uniform Tuned Liquid Column Damper for Structural Control of Buildings Resting on Loose Soil

In this study, through series of shaking table tests and statistical analysis, the efficiency of Uniform Tuned Liquid Column Damper (UTLCD) in structures resting on loose soils, considering soil-structure interaction was investigated. The soil beneath the structure is loose sandy soil. The Laminar Shear Box (LSB) as a soil container was adopted and the scaled form of the prototype structure nam...

متن کامل

Record Range of Uniform Distribution

We consider a sequence of independent and identicaly distributed (iid) random variables with absolutely continuous distribution function F(x) and probability density function (pdf) f(x). Let Rnl be the largest observation after observing nth record and R(ns) be the smallest observation after observing the nth record. Then we say Wnr = Rnl− R(ns), n > 1, as the nth record range. We will c...

متن کامل

Fully Distributed Modeling, Analysis and Simulation of an Improved Non-Uniform Traveling Wave Structure

Modeling and simulation of communication circuits at high frequency are important challenges ahead in the design and construction of these circuits. Knowing the fact that the lumped element model is not valid at high frequency, distributed analysis is presented based on active and passive transmission lines theory. In this paper, a lossy transmission line model of traveling wave switch (TWSW) i...

متن کامل

Buckling Analyses of Rectangular Plates Composed of Functionally Graded Materials by the New Version of DQ Method Subjected to Non-Uniform Distributed In-Plane Loading

In this paper, the new version of differential quadrature method (DQM), for calculation of the buckling coefficient of rectangular plates is considered. At first the differential equations governing plates have been calculated. Later based on the new version of differential quadrature method, the existing derivatives in equation are converted to the amounts of function in the grid points inside...

متن کامل

BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION RESEARCH REPORT SERIES No. RR-92108 The Discrimination Power of Dependency Structures in Record Linkage bY

A record-linkage process brings together records from two files into pairs of two records, one from each file, for the purpose of comparison. Each record represents an individual. The status of the pair is a “matched pair” status if the two records in the pair represent the same individual. The status is an “unmatched pair” status if the two records do not represent the same individual. The rec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008